Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

نویسندگان

چکیده

Abstract With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative transformers (GPT), etc. Inspired by success of these in single domains (like computer and natural language processing), multi-modal have also drawn more attention recent years. In this work, we give a comprehensive survey hope paper could provide new insights helps fresh researchers to track most cutting-edge works. Specifically, firstly introduce background pre-training reviewing conventional learning, works process, vision, speech. Then, task definition, key challenges, advantages (MM-PTMs), discuss MM-PTMs with focus on data, objectives, network architectures, knowledge enhanced pre-training. After that, downstream tasks used validation large-scale MM-PTMs, including generative, classification, regression tasks. We visualization analysis model parameters results representative Finally, point out possible research directions topic that may benefit future addition, maintain continuously updated list models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modal Consistency based Pre-Trained Multi-Model Reuse

Multi-Model Reuse is one of the prominent problems in Learnware [Zhou, 2016] framework, while the main issue of Multi-Model Reuse lies in the final prediction acquisition from the responses of multiple pre-trained models. Different from multiclassifiers ensemble, there are only pre-trained models rather than the whole training sets provided in Multi-Model Reuse configuration. This configuration...

متن کامل

Efficient Large-Scale Multi-Modal Classification

While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantiti...

متن کامل

Operational planning of a large-scale multi-modal transportation system

We describe the operational planning system POP developed for Danzas Euronet, a merger of Deutsche Post Transport and Danzas NTO. As of November 1997, the system has been daily used for the transport planning of on average 4000 container-orders a day on train and vehicles in Germany. Important features include the involvement of many practical constraints as well as the requirement to balance t...

متن کامل

Multi-modal Analysis of Music: A large-scale Evaluation

Multimedia data by definition comprises several different types of content modalities. Music specifically inherits e.g. audio at its core, text in the form of lyrics, images by means of album covers, or video in the form of music videos. Yet, in many Music Information Retrieval applications, only the audio content is utilised. Recent studies have shown the usefulness of incorporating other moda...

متن کامل

A Comprehensive Survey on Cross-modal Retrieval

In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Machine Intelligence Research

سال: 2023

ISSN: ['2731-538X', '2731-5398']

DOI: https://doi.org/10.1007/s11633-022-1410-8